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Abstract — A fast Discrete Cosine Transform (DCT) algorithm 
is introduced that can be of particular interest in image pro- 
cessing. The main features of the algorithm are regularity of 
the graph and very low arithmetic complexity. The 16-point 
version of the algorithm requires only 32 multiplications and 81 
additions. The computational core of the algorithm consists of 
only 17 nontrivial multiplications, the rest 15 are scaling factors 
that can be compensated in the post-processing. The derivation of 
the algorithm is based on the algebraic signal processing theory 
(ASP). 

I. Introduction 

The Discrete Cosine Transform (DCT) has found many 
applications in image processing, data compression and other 
fields due to its decorrelation property 1 1 1. Despite the fact that 
a number of fast DCT algorithms has been proposed already, 
designing new efficient schemes is still of great interest ||2]- 
|4|. Majority of proposed fast DCT algorithms have been 
obtained using graph transformation, equivalence relation or 
sophisticated manipulation of the transform coefficients. Re- 
cently an algebraic approach to derivation of fast DCT has 
been presented |5|. The approach uses polynomial algebra 
associated with DCT to obtain fast algorithms. Subsequently 
this theory has been called algebraic signal processing theory 
(ASP) 16J. The theory provides consistent algebraic interpre- 
tation of fast DCT algorithms. 

The paper presents derivation of fast DCT-2 n-point algo- 
rithm (n is a power of two) based on ASR The algorithm 
is recursive and has a regular graph. Another feature of the 
algorithm is very low arithmetic complexity: 16-point DCT 
requires only 32 multiplications and 8 1 additions (that is only 
one multiplication greater than |l7l), but the computational core 
of algorithm contains only 17 multiplication while other 15 are 
scaling factors that can be compensated in the post-processing. 
Because of the mentioned properties the algorithm is a very 
attractive choice for hardware DCT implementations. 

II. Algebraic approach to DCT 

In this section the fundamentals of algebraic signal pro- 
cessing theory 16] are considered that are used further for 
derivation of the fast DCT-2 algorithm. 



A. Background: polynomial algebras 

A polynomial algebra is a vector space over the field F 
denoted as 

Aw^¥[x]/p{x). (1) 

The elements of algebra is the set of all polynomials in x over 
F of degree smaller than deg(p) — n. Ar is equipped with 
the operations of usual polynomial addition and multiplication 
modulo the polynomial p{x). 

Using the Chinese remainder theorem (CRT) a polynomial 
algebra (fTl) can be decomposed into a direct sum of one- 
dimensional subalgebras 

J-; ¥[x]/pix)-^ ¥,[x]/{x-ak), (2) 

0<fc<n 

provided that zeros a ~ {oq, . . . , a„_i) of p{x) are pairwise 
distinct and a^ G F. The mapping T is represented in matrix 
form 

J^ = Vb.a = [P£(afe)]o<fc,^<n, (3) 

if a basis b — {po, . . . ,Pn~i) is set in F[a;]/p(x) and unit bases 
(x°) — (1) is chosen in each ¥[x]/{x — Uk)- 'Pb,a is referred 
to as polynomial transform for Ar with basis 6 ||6). A scaled 
polynomial transform is obtained for a different basis f3k in 
each F[a;]/(a; — afc): 



J--diag(l//?i,...,l//?„_i)-n,a- 



(4) 



B. Derivation of fast transform algorithms in ASP 

In ASP transforms is represented as matrix-vector products 



y = Tx, where T = [tk,i] 



0<k,i<n- 



The fast transform algorithm is viewed as factorization of T 
into a product of sparse structured matrices. This approach 
has advantages from an algorithmic point of view. It reveals 
the algorithm structure and simplifies manipulation with it to 
derive a new variants. 

In the paper the following basic matrices are used: 



"1 




Jn^ 




1' 




1_ 




1 





Permutation matrices that has exactly one entry 1 in row i at 
position f{i) and each column and elsewhere is defined as: 

P: ii-^ f{i), <i < n. 

One important is the n x n stride permutation matrix defined 
for m\n as 

Tl 

Lm- 'i-2 1- ii ^-^ iim + i2 

m 

for < ii < — , < i2 < rri. 

ASP states that every DCT corresponds to some polynomial 
algebra F[a;]/p(a;) with basis b. In this case DCT is given by 
the CRT (J2]i and its matrix takes the form of polynomial trans- 
form ([3]) or a scaled polynomial transform (HI. From (J2]l it can 
be seen that T decomposes F[x]/p(a;) into one-dimensional 
polynomial algebras. Fast algorithm is obtained by complying 
this decomposition in step using an intermediate subalgebras. 

One possible way to perform decomposition of ¥[x]/p{x) 
in step is to use factorization p{x) — q{x) ■r{x). If deg(q) = k 
and deg(r) = m then 

¥[x]/p{x) 
-^ ¥[x]/q{x)(B¥[x]/r{x) (5) 

¥[x]/ix~l3i)® ¥[x]/{x^^,) (6) 



-> 



0<i<fc 



0<j<77t 



¥[x]/{x^a.) 



(7) 



0<i<n 



where f3i and 7j are the zeros of q{x) and r{x) correspond- 
ingly. If c and d are the bases of F[a;]/(7(a;) and F[x]/r(x), 
respectively, then (|5]l-(|7]i are expressed in the following matrix 
form (6): 

Vb,c = PiVc.p®Vd^-,)B, (8) 

where A (B B — [^ g] denotes the direct sum of matrices. 
Step (J6]) uses the CRT to decompose ¥[x]/q{x) and F[x]/r(x). 
This step corresponds to the direct sum of matrices Vcp and 
Vd,j- Finally permutation matrix P maps the concatenation 
(/?, 7) to the ordered list of zeros a in d?]). Given that B is 
sparse (IHll leads to a fast algorithm. 

C. Polynomial algebras for DCT-2 and DCT-4 

This subsection introduces polynomial algebras which is 
connected with DCT-4 and DCT-2. Let us first consider the 
polynomial algebra associated with DCT-4„ 



Aw = ¥[x]/2Tn{x), b^iVa,..., K-i), 



(9) 



where T and V are Chebyshev polynomials of the first and 
third kind, respectively. This Chebyshev polynomials have the 
following closed form expressions (cos 6 = x) 



T„{x) = cosine), Vn{x) 



s(n+i)e 



Ofc = cos(fc + ^)^, < fc < n are zeros of 2T„{x). in 
accordance with ^ polynomial transform for algebra (J9]) is 
defined as 



Va 



[Ve{ak)]o<k,£<n 



cos(fc+l)(^+l)g 
cos(/c+i)|^ 



(10) 



In order to get the matrix of DCT-4„ ( lOl is multiplied from 
the left by scaling diagonal matrix 

Z?!^''^ = diago<fc<„ (cos(fc + i)^) 

that yields 

DCT-4„=[cos(fc+i)(£+i)^]„<,^,^„. (11) 

Eq. (fTO]l-( 1 1 1 show that DCT-4 is a scaled polynomial trans- 
form of the form Q for the specified polynomial algebra (|9|. 
DCT-2„ is arisen from polynomial algebra 

Aw^¥[x]/{x-l)Un-iix), 6=(l/o,.-.,14-i), (12) 

where U is Chebyshev polynomial of the second kind that can 
be written as (cos 9 — x): 

TJ (t.\ _ sin{n+l)e 
^n\X) — sine ■ 

Since zeros of Un{x) is given by ak — cos 



(fc+l)7r 
n+1 ' 



polynomial transform for ( 12 1 takes the form 

r . M "c0Sfc(£-|-i)^ 



< fc < n 



(13) 



To obtain DCT-2 matrix (13 1 need to be multiplied from the 
left by the scaling diagonal 

Di^^^ = diago<fc<„ (cos 1^) . 

Polynomial transform corresponding to discrete trigonometric 
transform (DTT) is denoted as DTT, for instance DCT-4„ 
stands for the matrix in ( [TO] l. 

In what follows we need skew DCT-4(r). In ||6l this 
transform was introduced since it appears to be important 
building blocks of Cooley-Tukey type of algorithms for DCT. 
Skew DCT-4(r) associates with polynomial algebra 

Av = ¥[x]/{2T„{x) - 2cosr7r) 

with the basis b — {Vq, . . . ,Vn-i), where < r < 1. The 
conventional DCT-4„ is the special case of skew DCT-4„(r) 
for r ^ 1/2. 

III. Derivation of fast DCT-22fc algorithm 

In this section the procedure of algebraic derivation of fast 
DCT-22fc algorithm is given in detail. According to (12 1 the 
polynomial algebra corresponding to DCT-22fc is given by 

^F = F[x]/(a; - l)C/2._i(a:), b^{Vo,..., V^.^i). 

Important issue is to choose the base field F. Since Chebyshev 



polynomials V and U which is included in definition ( 12 1 have 
integer coefficients (for example V2{x) ~ Ax^ — 2x — 1), the 
base field F is set to the field of rational numbers Q. The filed 
is extended during factorization of polynomial U2k^i{x), since 
the polynomial is not factored over Q. 

It is well known 1 1 1 that fast DCT-22„ algorithm can 
be reduced to fast DCT-2„ and DCT-4„ algorithms. Using 
factorization for the Chebyshev polynomial of the second kind 

U2n-l{x) = Un~lix) ■ 2Tn{x), 



the algebra Q[x]/(a: — l)^2n-i(2;) with basis b = (Vq, . . . 
V2n-i) can be decomposed as 

Q[x]/{x-l)U2n-l{x) 

-^ Q[x]/{x-l)Un-i{x)®Q[x]/2Tn{x), (14) 

that according to (|5]l-(j7]i leads to the following fast algo- 
rithm (61 



where P is a permutation matrix of the form 



P 



DCT-22„ = Lf (DCT-2„ ® DCT-4„)S2„, (15) and B^„ '(0 is the change of basis matrix 



where L^" is the stride permutation matrix and B2n is change 
of basis matrix. i?2ri maps basis b to the concatenation (c, d), 
where c = d = {Vq, . . . Vn-i) are the basis for subalgebras in 
the right-hand side of ( [T4| . The first n columns of B2n are 



B. 



since the elements V^ g 6 for < ^ < n are already contained 
in c and d. The rest entries are determined by the following 
expressions 

mod {x ~ l)Un (16) 

mod 2T„, (17) 



Vn+l = —Vn-e-1 



which yields 



B^ 



2n 



^n ^n 
^n ^n 



(|T6)l-(17i can be induced using the following relation 2r„ = 
Vn + K-1, {x - l)[/„_i = K - K-1 and K = 2a;K-i - 
Vn-2- Note that decomposition ( 14 1 does not require extension 
of based field Q. This leads to multiplication-free change of 
basis matrix i?2n. 



When the size of DCT-2 is power of two (15i can be 



applied recursively to obtain fast algorithm. Thus, the problem 
of derivation of fast DCT-22fc algorithm reduces to derivation 
of fast DCT-42fc-i algorithm. From the ASP point of view the 
question is how to factor polynomial 2T„ (when n is power of 
2) in step. We propose to use the following general recursive 
formula 

2T2„(x)-2cosr7r= (2T„(a;) - 2cos ^) 

X (2r„(a;)-2cos7r(l-§)), (18) 

that can be proved using the closed form of T2„, parameter 
r S (0, 1). The special case of ( [Tsj ) for r = 1/2 specify 
factorization of 2r2T!- Using ( [T8] l polynomial algebra related 
to DCT-42„(r) is decomposed a^ 

Qcosr7rN/(2r2„(a;) - 2cosr7r) 
^ Qcosi^N/(2r„(a;)-2cos^)® 

Qcos r^ [x\l{2T^{x) - 2 cos 7r(l - §)). (19) 

The decomposition leads to the following fast algorithm 



DCT-42„(r)-P-(DCT-4„(l 



5DCT-4„(l-i)).<^)(r), 



(20) 



P, 



(C4)(^)^ 



^m 


(2cosf/„- 


-Jm) 




^m 


(_2cosf/„ 


-Jm) 




^m 


/ml 




\lm 


-Jm 1 


^m 


-Im\ 






2cos^/,„J 



(21) 



which is determined by 

Vn+i = -Vn-i-1+2 cos "fVt mod 2r„ - 2 cos ^ 

2 rt mod 2r„ — 2cos7r(l — |). 



y„+, = ~K-£-i-2cos^y, 



Decomposition ( [T9| requires extension of the based field 
'cos rTT to Qcos ^ • New elements of the field appears in matrix 



B. 



(C4) 



(r). 



Joint use of factorizations ( [T5| l and pO] ) leads to the new 
fast DCT-22fc recursive algorithm. The basic operation of 
the algorithm is multiplication by the matrix Bjn {'"')■ ^^ 
nontrivial multiplication concentrate in it that is very similar 
to butterfly operation in FFT algorithm. 

IV. Fast DCT-2i6 algorithm 

In this section the proposed approach is applied to derivation 
of fast DCT-2i6 algorithm. At first the transform expressed 
as a product 



(C2) 



DCT-2i6 = P'le ' • DCT-2i6 



(22) 



'Here ( 



is used as a short notation for field extension OfcosrTrl. 



Then factorization (15i and (20 1 is applied recursively to 
obtain fast transform algorithm. Flow graph of this algorithm 
is shown in Fig. [T] (for simplicity scaling of the output is 
omitted). Fig. |2] explains the basic building block (BB) of 
the algorithm that performs the multiplication by matrix (21 1. 
All operations inside the BB are implemented on the input 
m components vectors. Evaluation of one BB requires 3m 
addition and m multiplication. 

The presented 16-point DCT-2 algorithm uses 32 multi- 
plication and 81 addition. However only 17 multiplication 
constitute the core of algorithm while other 15 is scaling 
factors that can be compensate in the post-processing. Also 
the Fig. [T] shows that algorithm include computation of 8- 
point DCT-2 that requires only 5 multiplication. It is the same 
result as in 1 8 1. In fact, proposed algorithm can be considered 
as generalization of Aral's DCT algorithm since the resulting 
computational scheme has very low multiplicative complexity 
and scaling outputs. 



DCT-28 



Xq 



Xj_ 

X^ 



X5 



■ ^6 



Xg 
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DCT-48 



Fig. 1. 16-point DCT algorithm 



h^ 




Fig. 2. Building block of the fast DCT algorithm 



V. Conclusion 

A fast 2'^'-point algorithm of DCT-2 based on ASP is pre- 
sented. The key features of the algorithm are regularity of the 
graph (DCT-2„/2 available inside of a DCT-2„) and very low 
arithmetic complexity (computational core of the DCT-2210 
algorithm contains only X)n=i '^^V multiplications). Regular 
graph of proposed algorithm is well suited for development of 
new parallel-pipeline architecture of DCT processor Also, the 
algorithm extends existing space of alternative fast algorithms 
of the DCT. It can be used by automatic code generation 
programs that search alternative implementations for the same 
transform to find the one that is best tuned to desired plat- 
form (9), llO). 
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